Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

JIT: fix filter liveness computation #23044

Merged
merged 2 commits into from
Mar 7, 2019
Merged

Conversation

AndyAyersMS
Copy link
Member

When a filter is finished executing, control can logically pass to the
associated handler, any enclosing handler or filter, or any finally or fault
handler nested within the associated try. This is a consequence of two-pass EH.

The jit was not propagating liveness from the nested handlers, which lead to a
live object being collected inadvertently.

This change updates fgGetHandlerLiveVars to find the nested handlers and
merge their live-in into the filter block live sets.

Because these implicit EH flow edges can create cycles in the liveness dataflow
equations, the jit will also now always iterate liveness when it sees there is
exception flow, to ensure livness reaches the appropriate fixed point.

Added test case.

Closes #22820.

When a filter is finished executing, control can logically pass to the
associated handler, any enclosing handler or filter, or any finally or fault
handler nested within the associated try. This is a consequence of two-pass EH.

The jit was not propagating liveness from the nested handlers, which lead to a
live object being collected inadvertently.

This change updates `fgGetHandlerLiveVars` to find the nested handlers and
merge their live-in into the filter block live sets.

Because these implicit EH flow edges can create cycles in the liveness dataflow
equations, the jit will also now always iterate liveness when it sees there is
exception flow, to ensure livness reaches the appropriate fixed point.

Added test case.

Closes #22820.
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Mar 5, 2019

@BruceForstall PTAL
cc @dotnet/jit-contrib @janvorli

No diffs on x64 (with --gcinfo). [there were diffs; see below]

Diffs on x86 on the test case and in a handful of places in corelib and fx.

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x86 default jit
Summary:
(Lower is better)
Total bytes of diff: 16 (0.00% of base)
    diff is a regression.
Top file regressions by size (bytes):
          10 : Microsoft.CodeAnalysis.dasm (0.00% of base)
           3 : System.Security.Cryptography.Algorithms.dasm (0.00% of base)
           3 : System.Security.Cryptography.Cng.dasm (0.00% of base)
3 total files with size differences (0 improved, 3 regressed), 126 unchanged.
Top method regressions by size (bytes):
           5 ( 0.18% of base) : Microsoft.CodeAnalysis.dasm - <ComputeAnalyzerDiagnosticsAsync>d__51:MoveNext():this
           5 ( 0.46% of base) : Microsoft.CodeAnalysis.dasm - <SetActiveTreeAnalysisTaskAsync>d__58:MoveNext():this
           3 ( 0.16% of base) : System.Security.Cryptography.Algorithms.dasm - System.Security.Cryptography.RsaPaddingProcessor:PadOaep(struct,struct):this
           3 ( 0.16% of base) : System.Security.Cryptography.Cng.dasm - System.Security.Cryptography.RsaPaddingProcessor:PadOaep(struct,struct):this
Top method regressions by size (percentage):
           5 ( 0.46% of base) : Microsoft.CodeAnalysis.dasm - <SetActiveTreeAnalysisTaskAsync>d__58:MoveNext():this
           5 ( 0.18% of base) : Microsoft.CodeAnalysis.dasm - <ComputeAnalyzerDiagnosticsAsync>d__51:MoveNext():this
           3 ( 0.16% of base) : System.Security.Cryptography.Algorithms.dasm - System.Security.Cryptography.RsaPaddingProcessor:PadOaep(struct,struct):this
           3 ( 0.16% of base) : System.Security.Cryptography.Cng.dasm - System.Security.Cryptography.RsaPaddingProcessor:PadOaep(struct,struct):this
4 total methods with size differences (0 improved, 4 regressed), 225979 unchanged.
System.Private.CoreLib.dasm had 2616 diffs
NuGet.Protocol.Core.v3.dasm had 156 diffs
System.Net.WebSockets.dasm had 13 diffs

Corelib text diffs seem to be an interleaving artifact. Looked at random sampling of the other diffs and all the methods with diffs had a filter with a nested try/finally. For example:

https://github.com/dotnet/corefx/blob/f7539b726c4bc2385b7f49e5751c1cff2f2c7368/src/Common/src/System/Net/WebSockets/ManagedWebSocket.cs#L442-L464

Size regressions are from extra prolog inits.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One thing that might be helpful is to add an example in the comments, to make it easier to understand the kind of EH structure that we're dealing with here.

@AndyAyersMS
Copy link
Member Author

Oops, forgot to have the new test case return 100 on success.

Copy link

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good

@AndyAyersMS
Copy link
Member Author

Actually I was wrong about no diffs on x64 (forgot to rebuild the new jit). Actual diffs are similar to those on x86:

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: 18 (0.00% of base)
    diff is a regression.
Top file regressions by size (bytes):
          14 : Microsoft.CodeAnalysis.dasm (0.00% of base)
           4 : NuGet.Protocol.Core.v3.dasm (0.00% of base)
2 total files with size differences (0 improved, 2 regressed), 127 unchanged.
Top method regressions by size (bytes):
           9 ( 0.29% of base) : Microsoft.CodeAnalysis.dasm - <ComputeAnalyzerDiagnosticsAsync>d__51:MoveNext():this
           5 ( 0.37% of base) : Microsoft.CodeAnalysis.dasm - <SetActiveTreeAnalysisTaskAsync>d__58:MoveNext():this
           4 ( 0.24% of base) : NuGet.Protocol.Core.v3.dasm - <OpenNupkgStreamAsyncCore>d__26:MoveNext():this
Top method regressions by size (percentage):
           5 ( 0.37% of base) : Microsoft.CodeAnalysis.dasm - <SetActiveTreeAnalysisTaskAsync>d__58:MoveNext():this
           9 ( 0.29% of base) : Microsoft.CodeAnalysis.dasm - <ComputeAnalyzerDiagnosticsAsync>d__51:MoveNext():this
           4 ( 0.24% of base) : NuGet.Protocol.Core.v3.dasm - <OpenNupkgStreamAsyncCore>d__26:MoveNext():this
3 total methods with size differences (0 improved, 3 regressed), 193918 unchanged.
5 files had text diffs but not size diffs.
System.Private.CoreLib.dasm had 720 diffs
System.Security.Cryptography.Algorithms.dasm had 433 diffs
System.Security.Cryptography.Cng.dasm had 433 diffs
System.Net.WebClient.dasm had 84 diffs
System.Net.WebSockets.dasm had 78 diffs

File line diff count is high because GC info doesn't diff very well. Corelib file diff again appears to be an interleaving issue; something has changed recently that breaks jit-diff determinism, or maybe it's an issue with the --gcinfo variant that I don't use all that often.

@AndyAyersMS
Copy link
Member Author

Ubuntu test builds failing in msbuild with errors like:

System.IO.IOException: Broken pipe ---> System.Net.Sockets.SocketException: Broken pipe
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)

@dotnet-bot retest Ubuntu x64 Checked Innerloop Build and Test
@dotnet-bot retest Ubuntu x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0)

OSX with its usual hangup issue.

@dotnet-bot retest OSX10.12 x64 Checked Innerloop Build and Test

Throwing in some GC stress legs...

@dotnet-bot test Windows_NT x86 Checked gcstress0xc
@dotnet-bot test Windows_NT x86 Checked gcstress0xc_jitstress1
@dotnet-bot test Windows_NT x86 Checked gcstress0xc_jitstress2
@dotnet-bot test Windows_NT x64 Checked gcstress0xc

@janvorli
Copy link
Member

janvorli commented Mar 6, 2019

The change doesn't look right for non-x86 (tried on x64). It does what I was afraid of - it causes double reporting of the problematic variable during GC scan. As i've mentioned in https://github.com/dotnet/coreclr/issues/22820#issuecomment-469859758, the variable is not expected to be reported by the GC info as live in the filter, since when the filter runs, we have the frame that has thrown the exception on stack and so we report it during the stack walk twice - once from the filter frame and second time from the frame of the throw.
Moreover, the GC info now contains records for the stack slot twice (slots 0 and 1). Once reported as pinned (?) in the filter and once as not pinned out of the filter. See from the jitdump of the simple repro:

Stack slot id for offset -16 (0xfffffff0) (frame) (pinned) = 0.
Stack slot id for offset -16 (0xfffffff0) (frame) = 1.
Register slot id for reg rax = 2.
Register slot id for reg rsi = 3.
Register slot id for reg rcx = 4.
Register slot id for reg rdx = 5.
Set state of slot 0 at instr offset 0x82 to Live.
Set state of slot 0 at instr offset 0xc4 to Dead.
Set state of slot 1 at instr offset 0x15 to Live.
Set state of slot 1 at instr offset 0x56 to Dead.
Set state of slot 1 at instr offset 0x5e to Live.
Set state of slot 1 at instr offset 0x82 to Dead.

The stack walker for non windows-x86 has an extremely complicated logic full of assumptions about what GC info does report and what it doesn't in funclets and filters and their parent frames. So while we could possibly modify it to reflect this new way of reporting liveness and prevent the double reporting, I would like to avoid touching that code as much as I can. Bugs in this code usually result in very hard to reproduce and debug GC holes.

So I would prefer modifying the change to be x86 windows only (not even x86 Unix as that one uses the same EH as x64, arm and arm64 everywhere, the x86 windows is an exception).

@jkotas do you have any opinion here?

@janvorli
Copy link
Member

janvorli commented Mar 6, 2019

Here is a GC stack walk log on x64 showing the double reporting (see Reporting frame sp-0x10)

TID 1610: condemned generation num: 2
TID 1610: { =========== BEGINGC 1, (requested generation = 2, collect_classes = 1) ==========
TID 1610: GCScan: Promotion Phase = 1
TID 1610: { Starting scan of Thread 0000023027567F70 ID = 1
TID 1610: Scanning ExplicitFrame 0000006660979510 AssocMethod = 00007FFEB9365A60M frameVTable = 00007FFF18EE8DC0V
TID 1610: Scanning ExplicitFrame 0000006660979510 AssocMethod = 00007FFEB9365A60M frameVTable = 00007FFF18EE8DC0V
TID 1610: Scanning Frameless method 00007FFEB9498370M ControlPC = 00007FFEB91F8FB8
TID 1610: Scanning Frame for method ILStubClass:IL_STUB_PInvoke
TID 1610: Scanning Frameless method 00007FFEB9365D58M ControlPC = 00007FFF1609AAFB
TID 1610: Scanning Frame for method System.GC:Collect
TID 1610: Scanning Frameless method 00007FFEB93831D8M ControlPC = 00007FFEB91F5F0F
TID 1610: Scanning Frame for method SimpleAyendeRepro.Program:IsExpectedException
TID 1610: Reporting r6 at 00000066`60979580 as base pointer  00000230`2965ee10 System.Exception
TID 1610: Returning 0x000000666097CF80 as the parent stack frame for filter funclet 0x0000006660979660
TID 1610: STACKWALK: Found Filter funclet @ SP: 0000006660979620, m_crawl.pFunc = 00007FFEB9383248; FuncletParentCallerSP: 000000666097CF80
TID 1610: Scanning Frameless method 00007FFEB9383248M ControlPC = 00007FFEB91F5A15
TID 1610: Scanning Frame for method SimpleAyendeRepro.Program:Main
TID 1610: "Live" scratch register r1  not reported
TID 1610: Reporting frame sp-0x10 at 00000066`6097cf60 as pinned base pointer  00000230`2965edf8 SimpleAyendeRepro.DisposableObject
TID 1610: Scanning ExplicitFrame 000000666097CC28 AssocMethod = 0000000000000000M frameVTable = 00007FFF18EAC100V
TID 1610: STACKWALK: Reached parent of filter funclet @ CallerSP: 000000666097CF80, m_crawl.pFunc = 00007FFEB9383248
TID 1610: Scanning Frameless method 00007FFEB9383248M ControlPC = 00007FFEB91F59B5
TID 1610: Scanning Frame for method SimpleAyendeRepro.Program:Main
TID 1610: "Live" scratch register r1  not reported
TID 1610: Reporting r6 at 00000066`6097cc78 as base pointer  00000230`2965ee10 System.Exception
TID 1610: Reporting frame sp-0x10 at 00000066`6097cf60 as base pointer  00000230`2965edf8 SimpleAyendeRepro.DisposableObject
TID 1610: Scanning ExplicitFrame 000000666097D6F8 AssocMethod = 0000000000000000M frameVTable = 00007FFF18F28370V
TID 1610: Scanning ExplicitFrame 000000666097E090 AssocMethod = 0000000000000000M frameVTable = 00007FFF18F28370V
TID 1610: Ending scan of Thread 0000023027567F70 ID = 0x1 }
TID 1610: { Starting scan of Thread 0000023027599A80 ID = 2
TID 1610: Scanning ExplicitFrame 00000066610FF930 AssocMethod = 0000000000000000M frameVTable = 00007FFF18EC7DE0V
TID 1610: Ending scan of Thread 0000023027599A80 ID = 0x2 }
TID 1610: { Starting scan of Thread 00000230294C9270 ID = 3
TID 1610: Ending scan of Thread 00000230294C9270 ID = 0x3 }
TID 1610: GcScanHandles (Promotion Phase = 1)
TID 1610: ========== ENDGC 1 (gen = 2, collect_classes = 1) ===========}

@janvorli
Copy link
Member

janvorli commented Mar 6, 2019

I've spent more time debugging on x86 and I have realized that I got mislead by the fact that the "k" windbg command was not showing the frame where the "throw" happened. It turns out the frame is still there and our stack walker still walks it.
So it seems to me now that we may be able to actually fix the issue in the runtime without touching the JIT. The stack walker code for x86 was written with the assumption that the filter frame reports the variables of the parent frame (the one that has thrown the exception), which as we've seen was true for the old JIT32. Now with the new x86 JIT behaving the same way as the x64 one and not reporting those locals, it seems we may be able to change the x86 part of the stack walker to not to handle the filter frame in a special way and it seems it would fix the problem too.

However, I am a bit torn here, as your current change that reports the variables in the filter on all platforms is actually better tracking the object lifetime reality. So it seems more "correct". Moreover, I've found that the double reporting on the other platforms doesn't cause trouble thanks to the fact that it is reported as "pinned" for the filter. It seems it just adds some overhead.

@BruceForstall
Copy link
Member

fyi, there is a discussion of GC, filters, and pinning here: https://github.com/dotnet/coreclr/blob/master/Documentation/botr/clr-abi.md#filter-gc-semantics

@AndyAyersMS
Copy link
Member Author

There was already an edge case for extending liveness reporting in filters in funclet EH models:

coreclr/src/jit/liveness.cpp

Lines 1071 to 1082 in 5a26509

if (HBtab->HasFilter())
{
VarSetOps::UnionD(this, liveVars, HBtab->ebdFilter->bbLiveIn);
#if FEATURE_EH_FUNCLETS
// The EH subsystem can trigger a stack walk after the filter
// has returned, but before invoking the handler, and the only
// IP address reported from this method will be the original
// faulting instruction, thus everything in the try body
// must report as live any variables live-out of the filter
// (which is the same as those live-in to the handler)
VarSetOps::UnionD(this, liveVars, HBtab->ebdHndBeg->bbLiveIn);
#endif // FEATURE_EH_FUNCLETS

And there is this bit of compensating logic in the jit GC encoder:

// gcMarkFilterVarsPinned - Walk all lifetimes and make it so that anything
// live in a filter is marked as pinned (often by splitting the lifetime
// so that *only* the filter region is pinned). This should only be
// called once (after generating all lifetimes, but before slot ids are
// finalized.
//
// DevDiv 376329 - The VM has to double report filters and their parent frame
// because they occur during the 1st pass and the parent frame doesn't go dead
// until we start unwinding in the 2nd pass.
//
// Untracked locals will only be reported in non-filter funclets and the
// parent.
// Registers can't be double reported by 2 frames since they're different.
// That just leaves stack variables which might be double reported.
//
// Technically double reporting is only a problem when the GC has to relocate a
// reference. So we avoid that problem by marking all live tracked stack
// variables as pinned inside the filter. Thus if they are double reported, it
// won't be a problem since they won't be double relocated.
//
void GCInfo::gcMarkFilterVarsPinned()
{

So the pinning is evidently the thing that makes the double reporting work out.

@janvorli
Copy link
Member

janvorli commented Mar 6, 2019

@BruceForstall thank you for the pointer to the doc. Then this change is definitely correcting the existing state according to the doc and should go in as is.

@AndyAyersMS AndyAyersMS merged commit f6cc013 into dotnet:master Mar 7, 2019
@AndyAyersMS AndyAyersMS deleted the Fix22820 branch March 7, 2019 18:52
@janvorli
Copy link
Member

janvorli commented Mar 7, 2019

@AndyAyersMS I think we should port this to 2.1 (and 2.2 unless changes from 2.1 flow automatically to 2.2).

@AndyAyersMS
Copy link
Member Author

I think we should port this to 2.1

I was thinking along those lines too. Let me look into it.

AndyAyersMS added a commit to AndyAyersMS/coreclr that referenced this pull request Mar 8, 2019
Port of dotnet#23044 to release/2.1.

When a filter is finished executing, control can logically pass to the
associated handler, any enclosing handler or filter, or any finally or fault
handler nested within the associated try. This is a consequence of two-pass EH.

The jit was not propagating liveness from the nested handlers, which lead to a
live object being collected inadvertently.

This change updates `fgGetHandlerLiveVars` to find the nested handlers and
merge their live-in into the filter block live sets.

Because these implicit EH flow edges can create cycles in the liveness dataflow
equations, the jit will also now always iterate liveness when it sees there is
exception flow, to ensure livness reaches the appropriate fixed point.

Added test case.

Closes #22820.
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
When a filter is finished executing, control can logically pass to the
associated handler, any enclosing handler or filter, or any finally or fault
handler nested within the associated try. This is a consequence of two-pass EH.

The jit was not propagating liveness from the nested handlers, which lead to a
live object being collected inadvertently.

This change updates `fgGetHandlerLiveVars` to find the nested handlers and
merge their live-in into the filter block live sets.

Because these implicit EH flow edges can create cycles in the liveness dataflow
equations, the jit will also now always iterate liveness when it sees there is
exception flow, to ensure livness reaches the appropriate fixed point.

Added test case.

Closes dotnet/coreclr#22820.


Commit migrated from dotnet/coreclr@f6cc013
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Access Violation on x86
5 participants